The main aim is predicting breast cancer patients chance of survival.
- Clean the data
- Augment the data
- Create some plots
- Statistical analysis
- Create the prediction tool
The main aim is predicting breast cancer patients chance of survival.
We are working with a dataset about Breast Cancer that we have obtained from kaggle website
This is the dataset we are working with:
## patient_id education id_healthcenter id_treatment_region ## 111035895969: 1 Diploma :265 1110000154: 17 1110000329:321 ## 111035896483: 1 Elementary :150 1110000280: 13 1110000330:305 ## 111035897677: 1 Middle School:122 1110000303: 13 1110000331:213 ## 111035897739: 1 Bachelor : 91 1110000305: 11 ## 111035897959: 1 Illiterate : 89 1110000181: 10 ## 111035898042: 1 High School : 65 1110000225: 10 ## (Other) :833 (Other) : 57 (Other) :765 ## hereditary_history birth_date age weight ## 0:359 Min. :1939 Min. : 1.00 Min. : 6.0 ## 1:480 1st Qu.:1979 1st Qu.:28.00 1st Qu.: 69.0 ## Median :1986 Median :33.00 Median : 78.0 ## Mean :1984 Mean :35.14 Mean : 75.1 ## 3rd Qu.:1991 3rd Qu.:40.00 3rd Qu.: 86.0 ## Max. :2018 Max. :80.00 Max. :101.0 ## NA's :2 ## thickness_tumor marital_status marital_length pregnency_experience ## Min. :0.0100 0:201 above 10 years:446 0:205 ## 1st Qu.:0.4000 1:638 under 10 years:393 1:634 ## Median :0.6000 ## Mean :0.5747 ## 3rd Qu.:0.8000 ## Max. :1.3000 ## ## giving_birth age_FirstGivingBirth abortion blood taking_heartMedicine ## 1 :400 above 30:466 0:686 A+ :199 0:317 ## 0 :198 under 30:373 1:153 A- :139 1:522 ## 2 :131 AB+ :136 ## 3 : 79 B+ :122 ## 4 : 14 AB- : 86 ## 5 : 12 (Other):156 ## (Other): 5 NA's : 1 ## taking_blood_pressure_medicine taking_gallbladder_disease_medicine smoking ## 0:249 0:385 0:572 ## 1:590 1:454 1:267 ## ## ## ## ## ## alcohol breast_pain radiation_history Birth_control menstrual_age ## 0:531 0:323 0:418 0:312 above 12:344 ## 1:308 1:516 1:421 1:527 not yet : 25 ## under 12:470 ## ## ## ## ## menopausal_age Benign_malignant_cancer condition treatment_age ## above 50: 37 Benign :335 death :424 Min. : 1.00 ## not yet :744 Malignant:504 recovered :144 1st Qu.:28.00 ## under 50: 56 under treatment:271 Median :33.00 ## NA's : 2 Mean :35.16 ## 3rd Qu.:40.00 ## Max. :80.00 ## NA's :2
## `geom_smooth()` using formula 'y ~ x'
| Before | After |
|---|---|
| The columns are different types | All the columns are considered as doubles |
| 0, 1, 2 values | bolean variables |
| names with /r/n | Clean names |
| Birth date with 3 characters | Birth date with 4 characters |
| Blood type44 | Correct blood types only |
| Weird weight/age correlations | Eliminating people under 18 years old |
For statistical analysis, we have chosen only women.
We have created some plots in order to fully understand the data and we have done some statistical analysis like MCA analysis. The plots are shown in the following point: “Results”
We have reached the following conclusions